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Abstract. Semiparametric Bayesian methods have been proposed in 
the literature for growth curve modeling to reduce the adverse effect 
of having nonnormal data. The normality assumption of measurement 
errors in traditional growth curve models was replaced by a random 
distribution with Dirichlet process mixture priors. However, both 
the random effects and measurement errors are equally likely to be 
nonnormal. Therefore, in this study, three types of robust distributional 
growth curve models are proposed from a semiparametric Bayesian 
perspective, in which random coefficients or measurement errors follow 
either normal distributions or unknown random distributions with 
Dirichlet process mixture priors. Based on a Monte Carlo simulation 
study, we evaluate the performance of the robust models and demonstrate 
that selecting an appropriate model for practical data analyses is very 
important, by comparing the three types of robust distributional models 
as well as the traditional growth curve models with the normality 
assumption. We also provide a straightforward strategy to select the 
appropriate model. 
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1 Introduction 


Longitudinal studies help us understand changes. Unlike one-off cross-sectional 
studies that give information about subjects at one point, like a snapshot 
photo, longitudinal studies follow subjects across time, more like a photo 
album. They tell a story of subjects not only at a moment in time, but also 
over time, showing how subjects have changed and what factors have caused 
between-subjects variations in change. Growth curve models are widely used in 
longitudinal research (e.g., McArdle & Nesselroade, 2014) as many longitudinal 
models in social and behavioral sciences, such as multilevel models and linear 
hierarchical models, can be written as a form of growth curve models. In practice, 
traditional growth curve model estimation is based on the assumption that both 
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random effects and within-subject measurement errors are normally distributed. 
However, data in social and behavioral sciences are rarely normal and may be 
contaminated by outliers (Cain et al., 2017; Micceri, 1989). Because ignoring 
the nonnormality of data may lead to imprecise or even inaccurate parameter 
estimates and misleading statistical inferences (e.g., Maronna et al., 2006; Yuan 
& Bentler, 2001), and routine methods, such as deleting the outliers, may lead to 
problems such as resulting inferences failing to reflect uncertainty and reduced 
efficiency (e.g., Lange et al., 1989; Yuan & Bentler, 2002), researchers have 
developed robust methods to obtain reliable parameter estimation and statistical 
inference. 


The basic ideas of robust methods often include two types. The first type 
is to assign a weight to each subject in a dataset according to its distance 
from the center of the majority of the data aiming to downweight potential 
outlying observations (e.g., Pendergast & Broffitt, 1985; Silvapulle, 1992; Singer 
& Sen, 1986; Yuan & Bentler, 1998; Zhong & Yuan, 2010). The second type 
is to use certain nonnormal distributions that are mathematically tractable, 
instead of normal distributions, to model data distributions. Both types of robust 
methods have been directly applied to growth curve modeling. For example, on 
the one hand, Pendergast & Broffitt (1985) and Singer & Sen (1986) proposed 
robust estimators based on M-methods for growth curve models with elliptically 
symmetric errors, and Silvapulle (1992) further extended the M-method to allow 
asymmetric errors for growth curve analysis. Yuan & Zhang (2012) developed 
a two-stage robust procedure for structural equation modeling with nonnormal 
missing data and applied the procedure to growth curve modeling. On the other 
hand, latent variables and/or measurement errors were assumed to follow a t or 
skew-t distribution (Tong & Zhang, 2012; Zhang, 2016) or a mixture of certain 
distributions (Lu & Zhang, 2014; Muthén & Shedden, 1999). While being useful, 
these methods still have limitations under certain conditions. For example, 
the downweighting method did not perform well when latent variables contain 
extreme scores (e.g., see simulation results in Zhong & Yuan, 2011). Using a t 
distribution or a mixture of normal distributions still imposed restrictions on 
the shape of the data distribution. 


Semiparametric Bayesian methods, also referred to as nonparametric 
Bayesian methods, can solve these issues as they are more flexible to relax the 
normality assumptions. Semiparametric Bayesian modeling relies on a building 
block, Dirichlet process (DP), which is a distribution over probability measures 
that can be used to estimate unknown distributions. Therefore, the nonnormality 
issue can be addressed by directly estimating the unknown random distributions 
of latent variables or measurement errors (i.e., obtaining the posteriors of the 
distributions). The advantages of using Semiparametric Bayesian methods have 
been discussed in the literature (e.g., Fahrmeir & Raach, 2007; Ghosal et al., 
1999; Hjort, 2003; Hjort et al., 2010; MacEachern, 1999; Miiller & Mitra, 2004). 
First, they do not constrain models to a specific parametric form that may limit 
the scope and type of statistical inferences in many situations. Second, they 
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can provide full probability models for the data-generating process and lead to 
analytically tractable posterior distributions. 

Because of their flexibility and adaptivity, semiparametric Bayesian methods 
have been applied to various models. Bush & MacEachern (1996), Kleinman 
& Ibrahim (1998), and Brown & Ibrahim (2003) used DP mixtures to handle 
nonnormal random effects. Burr & Doss (2005) used a conditional DP to handle 
heterogeneous effect sizes in meta-analysis. Ansari & Iyengar (2006) included 
Dirichlet components to build a semiparametric recurrent choice model. Si & 
Reiter (2013) used DP mixtures of multinomial distributions for categorical 
data with missing values. Semiparametric Bayesian methods have also been 
applied to structural equation modeling to relax the normality assumption 
of the latent variables (e.g., Lee et al., 2008; Yang & Dunson, 2010). Tong 
& Zhang (2019) directly used a DP mixture to model nonnormal data in 
growth curve modeling. Although it has been shown in Tong & Zhang (2019) 
that semiparametric Bayesian methods outperformed traditional growth curve 
modeling as well as Student’s t-distribution-based robust method when data 
were not normal, nonnormal data were generated with measurement errors 
nonnormally distributed and only measurement errors were modeled using 
semiparametric Bayesian methods. In practice, it is possible that random effects 
also violate the normality assumption. To account for this issue, we need to also 
model random effects semiparametrically. 

Therefore, in this study, three different types of robust distributional growth 
curve models are proposed from a semiparametric Bayesian perspective. The 
features of these three types of models as well as traditional growth curve 
model are also discussed. In the next two sections, after introducing the idea of 
semiparametric Bayesian modeling, we introduce three types of semiparametric 
Bayesian growth curve models. Then, we compare the three types of models and 
the traditional model in modeling different types of data through simulation 
studies. Recommendations are provided at the end of the article. 


2 Semiparametric Bayesian Modeling with DP Priors 


A typical motivation of using semiparametric Bayesian methods is that one is 
unwilling to make unverified assumptions for latent variables or measurement 
error distributions as in the parametric modeling. Under a semiparametric 
perspective, we model the distribution of a random vector € using a random 
distribution function G with a prior G. Namely, the traditional parametric 
assumption of the random vector € (i.e., € ~ N(pe, Be)) is replaced by 


BreG, 
GwgG, 
where G is an unknown distribution function and G is its prior, a distribution 


over the distribution G. The prior G can be chosen as the Dirichlet process 
(DP; Ferguson, 1973,7), which is the first prior defined for spaces of distribution 
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function and is the most widely used one. The Dirichlet process generates 
a random distribution function G, such that for any measurable partitions 
P,,...,P, of the sample space ¥, (G(P,),...,G(Px)) follows a Dirichlet 
distribution Dirichlet(aGo(P,),...,@Go(Pr)), where a and Go are parameters 
for the DP. For example, if ¥ is the real space and P = (—co,2] where z is a 
real number, then 


G(x) ~ Dirichlet(aGo(«), a(1 — Go(x))). 


Thus, 





Go(e)(1— Go(x)). 


Var(G(a2)) = ee 


The DP is characterized by the two parameters, a and Go. Go is a base 
distribution, which represents the central or “mean” distribution in the 
distribution space, while the precision parameter a governs how close realizations 
of G are to Go. For example, Figure 1 displays generated random distributions 
from the Dirichlet process given Gp and different values of a. The red lines in the 
four plots represent the cumulative density curve for the base distribution Go, 
which is a standard normal distribution in this case. Black lines in each figure 
represent G's generated from the Dirichlet process in five replications given Go 
and a. Clearly, as @ increases, generated G's are closer to Go. 

Ferguson (1973) introduced the DP as a random probability measure that has 
two desirable properties: (1) its support is sufficiently large, and (2) the posterior 
distribution is analytically manageable. He explained that the Dirichlet process 
is a conjugate prior and the posterior of G is DP(@,Gpo). The two parameters 
a=a+WN and 








x a 
Com oe NCO a+ N 
where G'y is the empirical distribution function of the data. Thus, the posterior 
point estimate of G, E(G|data) = Go, is a weighted average of two distributions: 
Go and Gy. If a = 0, the posterior point estimate is Gy, which is nonparametric. 
When a approaches infinity, the posterior point estimate approaches to Go, 
which is parametric. In practice, a ~ Gamma(aj, a2), which is neither 0 nor 
infinity. Thus, we consider the posterior point estimate of G as semiparametric. 


Gy, 


2.1 Stick-breaking construction 


Sethuraman (1994) developed a constructive way of forming G’, known as “stick- 
breaking” , and showed that draws from stick-breaking are indeed DP distributed 
under very general conditions. Let qi, q2,.-.,qz,--. ~ Beta(1, a). Define 


k-1 
Pk = 4k [[c — qj). 
j=l 
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Figurel1. Random distributions generated from the Dirichlet process in five 
replications given a standard normal base distribution and different values of a 


Then, 
G = Db /prdez, 
k=1 


where bes is the Dirac probability measure and €] ~ Go. It is important to note 
that )7,_ Pe = 1 as it guarantees G to be a distribution. 


The process of the stick-breaking construction is given below. 
1. Draw £7 from Go; 

2. Draw q, from Beta(1,a), then pi; = qm; 

3. Draw €5 from Go; 

4. Draw q2 from Beta(1,a), then pz = go(1 — q1); 
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Therefore, the distribution G(-) is a discrete distribution as 


1, p=pi 

£3, Pp=pe2 
G(-) = : 3 

&, P=DPk 


To define a continuous distribution, the Dirichlet process can be used as the basis 
of a mixture model, for example, a mixture of N (jz, 07) with mixing proportions 
defined by py. Theoretically, there are an infinite number of mixture components 
as k = 1,...,c0, given an arbitrarily flexible choice of distributional shapes. 
Multimodal or heavy-tailed distributions can be naturally modeled in this way. 
In practice, a finite number of mixture components would be good enough, and 
this number is taken into account by the Dirichlet process. Smaller values of DP 
precision parameter a result in a smaller number of mixture components. 


3 Three Types of Semiparametric Bayesian Growth 
Curve Models 


Consider a longitudinal dataset with N subjects and T’ measurement occasions. 
Let yi = (y,---, yer)’ be a T x 1 random vector with y;; being an observation 
from subject i at time j (i = 1,...,N;j = 1,...,T). A typical growth curve 
model can be written as 


yi — Ab, +e, 
b,; =6B+u, 


where A is a T x q factor loading matrix that determines the growth curves, b,; 
is a q X 1 vector of random effects, and e; is a vector of measurement errors. 
The vector of random effects b; varies around its mean (. The residual vector 
u; represents the deviation of b; from 8. When 


1 O 
1 


: Li B 
all i= (51), onda ( Fe) 


1T-1 


the model is reduced to a linear growth curve model with random intercept L; 
and random slope $;. The mean intercept and slope are denoted as 8, and Bg, 
respectively. 

Traditionally, e; and u; are assumed to follow multivariate normal 
distributions with mean vectors of zero and covariance matrices ® and W, 
respectively, so e; ~ MNr(0,@) and u; ~ MN,(0,W), where MN denotes 
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a multivariate normal distribution and its subscript indicates its dimension. 
Although traditional growth curve models are widely used, they can be deficient 
because practical data often violate the normality assumption. Tong & Zhang 
(2019) proposed to model e; using semiparametric Bayesian methods to account 
for the nonnormality of data. However, since the nonnormality of a growth curve 
model may come from two resources — the measurement errors e; and the random 
components u; (Pinheiro et al., 2001), we model either one or both of them 
semiparametrically and propose three types of robust distributional growth curve 
models. The first type of robust semiparametric Bayesian growth curve models is 
the same as what Tong & Zhang (2019) proposed: we let e; ~ Ge, Ge ~ DP and 
keep u; ~ MN,(0,¥%). The second type of robust growth curve models can be 
derived by keeping e; ~ MNr(0,®) and letting u; ~ G,,G, ~ DP. The third 
type of robust growth curve model can be obtained by letting e; ~ G., Ge ~ DP 
and u; ~ Gy,Gy ~ DP. We denote the three types of robust growth curve 
models as the Semi-N distributional model, the N-Semi distributional model, 
and the Semi-Semi distributional model, respectively. Similarly, we also denote 
the traditional growth curve model as the N-N distributional model. 


3.1 Implementation: truncated stick-breaking construction 


3.1.1 Semi-N distributional model. In the Semi-N distributional model, 
we assume that e; ~ G. where G, is an unknown random distribution that 
is determined by the data. Because the distribution of e; is continuous, a DP 
mixture (DPM) can be used to model the measurement errors such that 


D(w), 6), with p= pi 

D(u, )), with p = po 
GH 

D(u, Bi"), with P=DPk 


where D represents a predetermined multivariate distribution (e.g., multivariate 
normal, t, multinomial, etc.), and ps) and 6k = 1,...,00 are means 
and covariances of the multivariate distribution in the kth component with 


probability p,. Tong & Zhang (2019) proposed that 


e;|®; ~~ MN,(0, 2%), 
|G ~ G, 
Ga DP(G)Go): 


That is, the unknown distribution G,. is approximated by a mixture of 
multivariate normal distributions where the mixing measure has a Dirichlet 
process prior, G. ~ DPM. The DP prior DP(a,Go) can be obtained using 
the truncated stick-breaking construction (e.g., Lunn et al., 2013; Sethuraman, 
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1994). Specifically, DP(-) = °@_, pj62,(-),1 < C < oo, where C (1 < C < 
N,often set at a large number) is a possible maximum number of mixture 
components, 6,,(-) denotes a point mass at z; and z; ~ Go independently. The 
random weights p; can be generated through the following procedure. With 
M1; 92;---;4c ~ Beta(1, a), define 


j-1 
P; =G [[0-a), 3 =1,...,€. 
k=1 


Then, p; is obtained by 


Pj = Ps 
I~ GO? 

Dok=1 Pr 
to satisfy that 4 p= 

Thus, the distribution of e; through the truncated stick-breaking construction 
is 

MN(pS?, 6), with p= pi 
G MN(p?, 6), — with p = po 


MN (uw, BO), with p= pe 

Given that the mean of e; is 0, we constrain aa, Dj py ) = 0. For simplicity, 
we follow Tong & Zhang (2019) and constrain pe ) to be 0. We use inverse 
Wishart priors p(®%)) = IW (no, Wo) for the covariance matrices of the mixture 
components, 7), 7 = 1,...,C. Following Lunn et al. (2013, page 294), we fix the 
shape parameter no at a specific number and assign an inverse Wishart prior to 
the scale matrix Wo. With such a specification, the measurement error for subject 
i, e;, has a p; probability of coming from the mixing component MN(0, &\5)), 
If e;,i = 1,...,N are from K;, different distributions among MN(0,®), 7 = 
1,...,C, K, is called the number of clusters for e;. Clearly, K. < C, and within 
each cluster, e;s come from the same distribution. 

Bayesian methods are applied to estimate the model. The key idea of 
Bayesian methods is to compute the posterior distributions for model parameters 
by combining the likelihood function and the priors. Recall that in traditional 
N-N distributional growth curve model, 8,®, and W are the model parameters. 
Here in the Semi-N model, G and W are still model parameters and can be 
estimated in the same way. However, instead of estimating ® as in the N-N 
model, we obtain e; and K,.. The estimate of K. indicates the heterogeneity 
of between-subject measurement errors e;. With a larger value of K., we are 
more confident to conclude that different subjects’ measurement errors are 
distributed differently. To obtain an estimate of ® (the covariance matrix of e;), 
we let e;(s),4 = 1,...,N be the observations of e; simulated from the posterior 
distribution in the sth Gibbs sampler iteration, and let ®;,) be the corresponding 
sample covariance matrix. An estimate of ® can be taken as the mean of ®,5), 
averaging over all the Gibbs sampler iterations after the burn-in period. 
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3.1.2. N-Semi distributional model In the N-Semi model, u; follow an 
unknown distribution G,, with a Dirichlet process prior. We can obtain the 
mixing proportion p, and construct the distribution G,, in a similar way as in 
the Semi-N model. 


MN(pi), WO), P=P1 
co, J MNHOB), p= pr 


MN (wh, WO), P=Pc 


where pl and W(") k = 1,...,C are parameters of the multivariate normal 
distribution in the kth component. Since u,; represents the random component 
of the random effects b;, it is also reasonable to set ps*) = 0. For the covariance 


matrices of the mixture components, @), inverse Wishart priors are used 
p(w) = IW (mo, Vo), 


where mp and Vo are hyperparameters. 

Therefore, u; comes from MN(0,)) with the probability py. The number 
of clusters for u; is denoted by K,,. Within each cluster, u;s come from the same 
distribution. 

In contrast to the N-N and Semi-N distributional growth curve models, in the 
N-Semi model, we obtain u; and K,, in the Markov chain Monte Carlo (MCMC) 
procedure instead of estimating W, while the fixed effects G and the covariance 
matrix of measurement errors ® are still model parameters and estimated in 
the same way. The estimate of K,, indicates the heterogeneity of random effects 
for different subjects. If K, is large, we are more confident to conclude that 
different subjects have different growth trajectories. To obtain an estimate of W 
(the covariance matrix of u,), we let ujs),i = 1,...,N be the observations of 
u; simulated from the posterior distribution in the sth Gibbs sampler iteration, 
and let %,) be the corresponding sample covariance matrix. An estimate of 
W is the mean of %,), averaging over all the Gibbs sampler iterations after 
the burn-in period. For the linear growth curve model, the estimate W isa 
2 x 2 matrix ((67,¢15)’, (@15,62)’). The significance of 67 and 62 imply the 
existence of between-subject differences in the initial level and the rate of change, 
respectively. A significant ¢;5 means that the initial level and the rate of change 
are significantly correlated. 


3.1.3. Semi-Semi distributional model In the Semi-Semi model, both 
e; and u; follow unknown distributions G. and G,, separately. The two 
distributions can be constructed in the same way as in the Semi-N and N-Semi 
distributional models. Consequently, we cannot obtain both the estimates of 
@ and W directly, but they can be calculated following the same procedure 
as discussed in previous sections, and be interpreted likewise. Besides © and 
W, other model parameters include 8, K., and K,, which can be estimated 
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explicitly in the MCMC procedure. The fixed effect G represents the average 
initial level and rate of change for all subjects. The number of clusters for e; and 
the number of clusters for u; are K, and K,,, indicating the heteroscedasticities 
of e; and u;, respectively. 


3.2 Visual model comparisons 


To illustrate the differences among the N-N, Semi-N, N-Semi, and Semi-Semi 
distributional models, we generate and plot data from the four types of models 
(Figure 2). For each type of model, data on 50 subjects are generated at four 
occasions assuming a linear growth trend. Figure 2(a) displays the trajectories 
of the data generated from the N-N distributional model. No outlier can be 
observed. The overall trajectory looks clean and smooth. Figure 2(b) plots 
the data generated from the Semi-N distributional model with nonnormal 
measurement errors and normal random effects. Noticeably, some observations 
stand out of the overall trajectory such as those labeled by 1 and 2. A close 
look at the two observations reveals that the reason why they deviate from the 
overall trajectory is that they are off their own growth trajectories. Figure 2(c) 
portrays data from the N-Semi distributional model with normal measurement 
errors but nonnormal random effects. Some observations also deviate from the 
overall growth trajectory. However, it seems that those observations are still 
on their own growth trajectories. The reason why they stand out is that the 
rate of growth for the specific case is very different from the majority of cases. 
Figure 2(d) draws the trajectories for the data from the Semi-Semi distributional 
model with both nonnormal errors and random effects. Clearly, the outlying 
observations are due to two sources - the trajectory of a case deviates from 
the overall trajectory and the observation for this specific case is off its own 
trajectory. For example, observation 1 stands out because it is off the trajectory 
of the case and the case itself has a lower initial level and a lower rate of change. 
In summary, Figure 2 suggests that the four types of distributional growth curve 
models can imply very different patterns in growth trajectories. For instance, if 
a subject’s growth trajectory is within the normal range of the overall trajectory 
and an observation at certain times stands out, the data are more likely to come 
from the Semi-N distributional model. If, within a subject, observations follow 
a smooth pattern but the trajectory itself differs from the overall trajectory, the 
data are more likely to come from the N-Semi distributional model. Therefore, 
given an empirical data set, it is very important to specify the correct type of 
growth curve models. In order to concretely demonstrate the possible adverse 
effects of misspecification for finite samples, we conduct a simulation study in 
the next section. 


4 A Simulation Study 


In this simulation study, we aim to evaluate the performance of the three robust 
distributional models as well as the traditional N-N model. Moreover, the effects 
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Figure 2. Trajectory plots of data generated from the 4 different types of distributional 
growth curve models. Data on 50 subjects are generated for 4 measurement occasions. 
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of the misspecification of the three types of robust distributional growth curve 
models will be studied to compare the intrinsic characteristics of them. We 
first generate data from the N-N, Semi-N, N-Semi, and Semi-Semi distributional 
models and name the data as N-N data, Semi-N data, N-Semi data, and Semi- 
Semi data, respectively. Then, for each type of data, we fit all four types of 
models and compare their parameter estimates. 


We focus on a linear growth curve model as discussed in the previous section 


yi = Abi +e, 

b; = B + Uj. 
In the model (see Figure 3), the fixed effects are given by B = (Br, Bs) = 
(6.2, 0.3) . 





o2=.5or7 o2=.50r.7 O5=.50F-7 o2=.5or.7 


Figure 3. Path diagram of a linear growth curve model. The numbers in the path 
diagram are population parameter values used in the simulation. 
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4.1 Study design 


In this study, seven possible influential factors are studied (see Table 1): type of 
model, type of data, potential number of clusters (C), sample size (NV), number 
of measurement occasions (T'), the covariance between the latent intercept and 
slope (o,s), and variance of measurement errors (02). 

First, four types of distributional growth curve models are considered, 
including the N-N, Semi-N, N-Semi, and Semi-Semi distributional models. 
Second, based on the four types of models, we generate four types of data, called 
N-N data, Semi-N data, N-Semi data, and Semi-Semi data correspondingly. We 
use each one of the four models to fit all four types of data under different 
conditions of the other five influential factors as described below. 

(1) Three different sample sizes are considered: N =50, 200, and 500. 
(2) The number of measurement occasions T is either 3 or 5. (3) For the 
semiparametric models, we assume that data are potentially from 5 or 20 
different clusters. (4) For the growth curve model parameters, the covariance 
between the latent intercept and the slope ozs is either O or 0.3, reflecting 
uncorrelated and correlated coefficients, respectively. When we generate u; from 
the semiparametric perspective, we simply generate ™™) ~ IW(mo,(mo — 
2—1)W) where W = ((07,018)',(o1s,0%)’) and the hyperparameter mo = 4 
so that the mean of W) is W and thus the “mean” of G,, is a distribution 
with its covariance matrix being W. (5) In practice, it is typical to assume the 
independence of measurement errors and the homogeneity of error variances 
across time, so the within-subject measurement error structure is usually 
simplified to = 021. The variance of measurement errors o? is manipulated 
to be 0.5 or 0.7 to investigate the influence of measurement errors. When we 
generate e; = (e;1,..., evr)’ semiparametrically, we can set &”) ~ IW(no, (no — 
T — 1)o21). However, in practice, it is easier to generate e;1,...,e;7 separately 


from a univariate distribution N (0,02), We generate o2(*) from o26*) ~ 


IG(co, do), where cy = 2 and dy = a? so that the mean of 02) is dy/(co—1) = 0?. 
Overall, 768 conditions of simulations are considered. For each condition, a 
total of 200 data sets are generated and analyzed in OpenBUGS (Lunn et al., 


2013). 


4.1.1 Pseudo-procedure to generate the Semi-Semi data 
. Set C equal to the number of clusters; 
. Generate ply, k= 1,...,C ; 
. Generate a2") ~ IG(co, do); 
. Generate p2,, k =1,...,C; 
. Generate W ~ TW(mo, (mo — 2 — 1)¥); 
. For 7 in 1: N, do 

(a) Randomly select a cluster based on p1;; 

(b) If the kth cluster is selected in (a), generate ej1,...,eir ~ 
N(0, 02%") and let e; = (ei1,..., er)’; 

(c) Randomly select a cluster based on p2,; 


Ookwnr 
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Table 1. Influential factors studied in the simulation study 1 





Factor # of factor levels Levels 





N-N model, Semi-N model, N-Semi model, Semi-Semi model 


Type of model 4 
Type of data 4 N-N data, Semi-N data, N-Semi data, Semi-Semi data 
# of clusters 3 5, 20 
Sample size 2 50, 200, 500 
#¢ of measurement occasion 2 3, 5 
Var(measurement errors) 2 0.5, 0.7 
2 0, 0.3 


Cov(intercept, slope) 
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(d) If the kath cluster is selected in (c), generate uj; ~ MN(0, ¥(*)); 
(e) Generate y; = AG + Au; + e;. 


4.2 Evaluation Criteria 


We obtain the parameter estimate, bias, relative bias, empirical standard error, 
mean square error (MSE), and coverage probability (CP) of the 95% highest 
posterior density (HPD) credible intervals | for each parameter. Let 6 denote a 
parameter and also its population value, and let 6,, r = 1,...,200 denote its 
estimates from the rth simulation replication. Furthermore, let i and t, denote 
the lower and upper limits of the 95% HPD credible interval for 0, respectively. 
Then, the parameter estimate of 0, 6, is calculated as the average of parameter 
estimates of 200 simulation replications 

Be alsisceee 

ti: 200 


Tr 


i=) 


Te 
1 


The bias of 6 is bias(0) = 6 — 0. The relative bias of 6 is 


6 
2 100x {--1] @40 
RB(6) = «(§ eae 
100 x 6 6=0. 


Note that the relative bias is rescaled by multiplying 100. Smaller relative bias 
indicates that the point estimate is less biased and thus more accurate. The 
empirical standard error is defined by 


200 


SEO) = 355 > (6-8) 


The mean square error is calculated by MSE(@) = bias(0)? + SE(). The CP 
is calculated as 4 
~  ##(Ip <0 < ty) 
CP(0) = ——___—— 

(9) ean ; 
where #(1, <0 < ti) is the total number of replications with credible intervals 
covering the true parameter value 8. Good 95% HPD credible intervals should 
give coverage probabilities close to 0.95. 


' Posterior credible interval, also called credible interval or Bayesian confidence 
interval, is analogical to the frequentist confidence interval. The 95% HPD credible 
interval [1, u] satisfies: 1. Prob(l < 6 < uldata) = 0.95; 2. for 01 € [l, u] and 62 ¢ [l, ul], 
Prob(@,|data) > Prob(@2|data). In general, HPD intervals have the smallest volume 
in the parameter space of 0, and numerical methods have to be used to find HPD 
intervals. 
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4.3. Results: Part I 


In this part, we evaluate the performance of the semiparametric models through 
comparing them with the traditional N-N model in parameter estimation. 

First, when data are normally distributed, the four models perform equally 
well, especially for large sample sizes. For example, Table 2 contains the absolute 
bias and the standard errors for the six important model parameters (62, 8s, 
o7, 0%, orgs, and o?) of the four distributional models, when data are generated 
from the N-N model with N = 500, T = 5, C = 20, ozs = 0.3, and o? = 
0.5. Apparently, there is no notable difference in the performance of the four 
models. When sample size is small, the overall pattern does not change much 
(see Table 3). For some parameter estimates, the semiparametric models may 
slightly outperform the traditional N-N model. 


Table 2. Parameter estimation for the four distributional models when data are 
generated from the N-N model with N = 500, T = 5, C = 20, oxrs = 0.3, and 


2 
oe = 0.5 





N-N model Semi-N model N-Semi model Semi-Semi model 

AB SE AB SE AB SE AB SE 
Br -0.004 0.049 -0.003 0.049 -0.003 0.050 -0.003 0.049 
Bs -0.002 0.017 -0.002 0.017 -0.002 0.017 -0.002 0.017 
o7 0.052 0.090 0.054 0.090 0.051 0.090 0.050 0.089 
o2% 0.017 0.009 0.017 0.009 0.015 0.009 0.015 0.009 
os -0.025 0.021 -0.024 0.021 -0.026 0.021 -0.026 0.021 
o2 -0.019 0.015 -0.020 0.015 -0.020 0.015 -0.020 0.015 

Note. AB: absolute bias; SE: empirical standard error. 











Table 3. Parameter estimation for the four distributional models when data are 
generated from the N-N model with N = 50, T = 3, C =5, ors =0, and 02 =0.1 











N-N model Semi-N model N-Semi model Semi-Semi model 

AB SE AB SE AB SE AB SE 
Br -0.001 0.157 0.004 0.161 0.001 0.158 0.001 0.158 
Bs 0.007 0.053 0.005 0.054 0.006 0.053 0.006 0.054 
o7, 0.025 0.226 0.029 0.230 -0.016 0.221 -0.021 0.221 
o% 0.039 0.028 0.037 0.027 0.019 0.028 0.018 0.028 
ois -0.015 0.057 -0.014 0.056 -0.017 0.055 -0.015 0.055 
o2 0.002 0.020 0.005 0.020 0.001 0.020 0.004 0.020 

Note. AB: absolute bias; SE: empirical standard error. 











Next, we evaluate the performance of the four models when data are not 
normally distributed. Specifically, we compare the N-N model to the Semi-N, 
N-Semi and Semi-Semi models in analyzing the Semi-N data, N-Semi data and 
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Semi-Semi data, respectively. We take a close look at the parameter estimates, 
bias, relative bias, empirical standard errors, MSEs, and CPs. 


Table 4 contains the estimation results of the N-N and Semi-N models when 
N = 200, T = 3, C = 20, ops = 0, and o? = 0.5 in analyzing the Semi-N data. 
When data are generated with the measurement errors coming from different 
clusters, using the Semi-N model consistently leads to less biased estimates, 
smaller standard errors and MSEs, and better CPs. For the fixed effects 6; and 
Bs, estimates from the N-N model and the Semi-N model are about the same. 
Standard errors are smaller for the Semi-N model. Also, CPs of the 95% HPD 
credible intervals from the Semi-N model are relatively closer to the nominal level 
95%. For parameters o7, a, and ozg which are related to the random effects, 
the bias and standard errors are uniformly smaller by fitting the Semi-N model 
to the data. Furthermore, the CPs for a and ozg increase from 0.910 and 0.905 
to 0.940 and 0.945, respectively, tending much closer to the nominal level 95%. 
We notice that the estimates of a? are around 0.475 for both the N-N and Semi- 
N models, the standard errors are large, and the CPs are extremely different 
from the 95%. These are because the measurement errors e;; are generated from 
N(0,02), and o? are generated from [G(2,0.5) to control the mean of o? to be 
0.5. However, data generated from [G(2,0.5) are usually less than 0.5 because 
this inverse Gamma distribution is skewed to the right. Therefore, in practice, 
we hardly can control the variance of the measurement errors when generating 
the Semi-N data, and thus, the bias, MSE, and CP for o? cannot be trusted for 
the Semi-N data as the population parameter values are unknown. Note that the 
parameter estimates and their standard errors can still be trusted. For the Semi- 
N model, the estimated number of clusters for e; is about 6 and the standard 
error of it is 0.653. There are 6 different clusters among the 200 subjects in the 
distribution of the measurement errors. Because we use informative priors for the 
DP precision parameter a to reduce the computational complexity and time, the 
estimate of a is very precise. The same pattern can be observed for all the other 
conditions in the comparison between the N-N and Semi-N models. Detailed 
tables under different conditions are available in Appendix A on our GitHub 
site: https: //github.com/CynthiaXinTong/SemiparametricBayeisnGCM. 


Table 5 presents the comparison between the N-N and N-Semi models when 
N = 200, T =5, C = 20, ozs = 0, and o? = 0.1 in analyzing the N-Semi data. 
The parameter estimates for the fixed effects 6; and Gs are about the same 
for both the N-N and N-Semi models, whereas the standard error estimates for 
Br, and Bg are smaller for the N-Semi model, usually resulting in smaller CPs 
of the HPD intervals. Under this specific condition, the CPs for the N-Semi 
model are closer to the nominal level 95%. For the variance estimate of the 
measurement error 02, fitting the two models leads to similar results as well. 
This phenomenon is closely related to the estimate of K,,. In this analysis, the 
estimate of K,, is 2.418, meaning that there are only 2 potential clusters for the 
random effects. In this case, using the N-Semi model may not be very different 
from using the traditional growth curve model. For parameter 07, 02, and ors, 
their bias, MSEs, and CPs cannot be trusted. The reason is similar to the reason 
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Table 4. Parameter estimation for the N-N and Semi-N distributional models when 
data are generated from the Semi-N model with N = 200, T = 3, C = 20, ors = 0, 
and o2 = 0.5 





N-N model Semi-N model 
Est. AB RB(%) SE MSE CP Est. AB RB(%) SE MSE CP 

Br 6.201 0.001 0.009 0.082 0.007 0.960 6.201 0.001 0.008 0.081 0.007 0.955 
Bs 0.303 0.003 0.845 0.041 0.002 0.980 0.302 0.002 0.620 0.039 0.001 0.970 
o7 1.016 0.016 1.576 0.138 0.019 0.970 1.014 0.014 1.395 0.134 0.018 0.970 
o% 0.135 0.035 35.280 0.035 0.002 0.910 0.132 0.032 31.663 0.028 0.002 0.940 
ozs -0.022 -0.022 -2.157 0.058 0.004 0.905 -0.019 -0.019 -1.899 0.053 0.003 0.945 
o2 0.475 -0.025 -5.076 0.365 0.134 0.240 0.476 -0.024 -4.835 0.364 0.133 0.215 
Ke - - = 7 - - 5.800 - - 0.653—- - 
a - - - - - - 0.999 -0.001 -0.069 0.006 0.000 1.000 
Note. Est.: estimate; AB: absolute bias; RB: relative bias; SE: standard error; MSE: 

mean square error; CP: coverage probability. 











why bias, MSE, and CP cannot be trusted for parameter o? in analyzing the 
Semi-N data. Here when the N-Semi data are generated, u; is generated from the 
multivariate normal distribution MN(0,W), where W = ((07, 018)’, (ors, 0%)’) 
is generated from an inverse Wishart distribution IW/(4, ((1,0)’, (0,0.1)’)) to 
control the mean of W to be ((1,0)’,(0,0.1)’). In practice, it is not possible 
to generate multivariate data evenly distributed around the the mean, so the 
population parameter values for W = ((07, 018)’, (oLs,0%)') are unknown, and 
thus, we cannot calculate bias, MSE, and CPs for those parameters. In this 
analysis, we still use informative priors for the precision parameter a to reduce 
the computational time. The above pattern can be observed under the other 
conditions as well when comparing the N-N and N-Semi models (see detailed 
results in Appendix A on our GitHub site). 


Table 5. Parameter estimation for the N-N and N-Semi distributional models when 
data are generated from the N-Semi model with N = 200, T = 5, C = 20, ors = 0, 
and o2 = 0.1 





N-N model N-Semi model 

Est. AB RB(%) SE MSE CP Est. AB RB(%) SE MSE CP 
Br 6.200 0.000 0.005 0.054 0.003 0.985 6.199 -0.001 -0.020 0.051 0.003 0.975 
Bs 0.299 -0.001 -0.457 0.021 0.000 0.970 0.298 -0.002 -0.699 0.019 0.000 0.965 
o7 0.836 -0.164 -16.353 1.304 1.726 0.120 0.829 -0.171 -17.113 1.299 1.715 0.050 
o2 0.094 -0.006 -6.150 0.098 0.010 0.195 0.089 -0.011 -10.798 0.098 0.010 0.055 
ors -0.009 -0.009 -0.919 0.244 0.060 0.345 -0.010 -0.010 -1.015 0.243 0.059 0.135 
o2 0.099 -0.001 -0.529 0.005 0.000 0.955 0.099 -0.001 -0.737 0.005 0.000 0.950 
Ku - - = - - - 2.418 - - 0.789 - - 
a - - - - - - 0.967 -0.033 -3.309 0.008 0.001 1.000 
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The comparison results between the Semi-Semi and N-N models are presented 
in Table 6 for the Semi-Semi data when N = 50, T = 3, C = 20, ops = 0.3, 
and o? = 0.5. For this comparison, we can only compare the bias, standard 
error estimates, MSEs and CPs for the fixed effects parameters. Clearly, the 
absolute bias for the two models is close to each other, whereas the standard 
errors are consistently smaller for the Semi-Semi model than those for the 
traditional N-N model, indicating the efficiency of the estimates can be increased 
by using the robust Semi-Semi model. When generating the Semi-Semi data, we 
cannot manipulate the covariance matrix of u; and the variance of e; exactly. 
Therefore, the population parameter values of 07, 02, ors, and 02 are unknown, 
so that the bias, MSEs, and CPs for these parameters cannot be evaluated. In 
Table 6, we also observe that the estimate of K, is 4.501 and the estimate 
of K, is 2.416, implying that there are about 5 clusters for e; and 2 clusters 
for uz, respectively, among the 50 subjects. Different subjects’ measurement 
errors are distributed differently, whereas their growth trajectories are not as 
much different. By using the informative priors for a, and az, the estimates of 
them are very precise. More comparison results between the Semi-Semi model 
and the N-N model under different conditions are available in Appendix A on 
https: //github.com/CynthiaXinTong/SemiparametricBayeisnGCM. 








Table 6. Parameter estimation for the N-N and Semi-Semi distributional models when 
data are generated from the Semi-Semi model with N = 50, T = 3, C = 20, ors = 0.3, 
and o2 = 0.5 





N-N model Semi-Semi model 

Est. AB RB(%) SE MSE CP Est. AB RB(%) SE MSE CP 
Br 6.195 -0.005 -0.087 0.166 0.028 0.980 6.196 -0.004 -0.060 0.147 0.021 0.970 
Bs 0.300 0.000 0.161 0.079 0.006 0.980 0.298 -0.002 -0.526 0.073 0.005 0.980 
oz 1.098 0.098 9.841 1.258 1.592 0.425 1.051 0.051 5.126 1.220 1.491 0.295 
o% 0.247 0.147 147.283 0.300 0.112 0.710 0.217 0.117 116.946 0.151 0.037 0.635 
orgs 0.157 -0.143 -47.786 0.440 0.214 0.275 0.163 -0.137 -45.702 0.351 0.142 0.165 
o- 0.550 0.050 10.086 0.907 0.826 0.285 0.543 0.043 8.606 0.959 0.922 0.230 








N 


N 


Ke - - - - - - 4,501 - - 0.420 - - 
Ky - - - - - - 2416 - - 0.584 - - 
Qa, - - - - - - 1.000 0.000 0.016 0.004 0.000 1.000 
a2. - - - - - - 0.980 -0.020 -2.007 0.006 0.000 1.000 





In sum, the performance of the four models is about the same for normally 
distributed data, especially when the sample size is large. When the sample 
size is small, even for normal data, some semiparametric models may perform 
slightly better than the traditional N-N model in the precision of parameter 
estimation. When data are not normally distributed, the traditional N-N model 
performs relatively worse than the semiparametric models. They may not exhibit 
quite different parameter estimates for fixed effects Gr and 8g, but the standard 
errors for all parameters are smaller for the semiparametric models than those 
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for the N-N model, potentially resulting in higher statistical power. In addition, 
the differences between the N-N model and the semiparametric models are 
closely related to the numbers of clusters K, and K,, which represents the 
heteroscedasticities of e; and u,, respectively. If Ke or Ky, is much larger 
than 1, data are more likely to be nonnormal, and the differences between the 
results from the N-N model and the semiparametric models should be bigger. 
Theoretically, if the estimates of kK. and K, are 1, the parameter estimation 
from the Semi-Semi model should be the same as those from the traditional N-N 
model. 


4.4 Results: Part II 


We have shown that the semiparametric models perform at least equally 
well as the traditional N-N growth curve model when data are normal, 
and perform better when data are nonnormal. We recommend utilizing the 
semiparametric models in practical data analyses. Because there are three 
different semiparametric models, another purpose of this simulation study is 
to evaluate the effects of the misspecification of the three types of distributional 
growth curve models. Two commonly used statistics, which examine more than 
one performance criterion (Collins et al., 2001), are calculated for each model 
parameter to compare the three types of semiparametric growth curve models. 
The first statistic is the MSE based on 200 sets of parameter estimates and 
standard errors, and the second one is the CP of the 95% HPD credible intervals. 
The MSEs and CPs are then averaged over certain model parameters for each 
simulation condition. For the Semi-N data, MSEs and CPs are averaged over 
Br, Bs, 07, 0%, and oxg, because the MSE and CP for o? cannot be trusted, 
as explained previously. For the N-Semi data, MSEs and CPs are averaged over 
Br, Bs, and o? since the population parameter values for 07, 02, and os are 
unknown. For the Semi-Semi data, MSEs and CPs are only averaged over 6, 
and Bg. 

Table 7 summarizes the results for the analysis of each type of data by 
different types of distributional models with different sample sizes when T = 5, 
C =5, ots = 0, and o? = 0.1. In the table, on the rows are the different types 
of generated data and on the columns are the three types of semiparametric 
distributional models used to analyze the generated data. In almost all situations, 
the model used to generate the data provides the best estimation results with 
smaller MSE and better credible interval coverage among the three types of 
robust growth curve models. For example, for the Semi-N data with N = 200, 
the Semi-N distributional model gives the best coverage probability and a 
comparable MSE to the other models. Similarly, for the N-Semi data with 
N = 50, the MSE for the N-Semi model is one of the smallest and the CP 
for the N-Semi model is the closest to the nominal level. Intuitively, we may 
consider the Semi-Semi model as the most general model and apply it to all 
the cases. However, it is not always a good idea. First, through our simulation 
results, although the MSEs for the Semi-Semi model are the smallest under 
different conditions, the CPs for the Semi-Semi model are not always the 
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best. By using the Semi-Semi model, the parameter estimates are slightly less 
accurate, while the standard errors are slightly smaller. Unexpectedly, the slight 
changes in the estimates and standard errors may result in a substantially lower 
coverage probability. Thus, the Semi-Semi distributional growth curve model 
is not optimal all the time. Second, theoretically, although the semiparametric 
approach is the same as the traditional growth curve analysis when the numbers 
of clusters take the value of 1, the estimated numbers of clusters are almost not 
possible to be 1 when we fit a semiparametric model to normal data. Because 
in each iteration of the MCMC sampling procedure, we count the number of 
clusters, which are at least 1. If in one iteration, the number of clusters happens 
to be bigger than 1 due to sampling errors, the estimated number of clusters 
cannot be exact 1. Therefore, semiparametric approach is not the same as 
the traditional growth curve analysis when analyzing normal data. One will 
lose statistical accuracy and increase type I errors by fitting the Semi-Semi 
distributional model to the N-N, Semi-N, or N-Semi data. Third, practically, 
estimating a Semi-Semi distributional model is more time-consuming than other 
types of models. It is often worth putting effort into determining the distributions 
of random effects and measurement errors to select the correct type of model. 

The above results hold for different sample sizes, the number of measurement 
occasions, the potential number of clusters, the covariance between the latent 
intercept and slope, and the variance of the measurement errors. Take a closer 
look at the influence of these factors, we notice that the MSEs decrease as the 
sample size increases. By comparing Tables 7 and 8, Tables 7 and 9, Tables 
7 and 10, and Tables 7 and 11, we observe separately that the number of 
measurement occasions, the potential number of clusters, the covariance between 
the latent intercept and slope, and the variance of the measurement errors do not 
affect the performance of the semiparametric models. More tables under different 
conditions are given in Appendix B on our GitHub site: https://github.com/ 
CynthiaXinTong/SemiparametricBayeisnGCM. 

In summary, the accuracy and efficiency of the estimation for a specific type 
of data closely depend on the correct specification of a model. Consequently, in 
practical data analyses, it is important to choose the correct type of model. 


4.5 Model selection 


Tong & Zhang (2012) proposed three model diagnostic methods and the 
“distribution checking based on individual growth curve analysis” method can 
be easily adopted for the semiparametric approach. In this method, an individual 
growth curve (y; = Ab; + e;) is first fitted to data from each individual. Using 
the least square estimation method, the individual coefficients (random effects) 
b; = (biz, bis)’ and the measurement errors e; = (e;1,...,e¢r)? are estimated 
and retained. Let b = (b;,---, by)? and e = (&,---,@y) where b is a 
N x 2 matrix of individual coefficients estimates and e is a N x T matrix of 
estimated errors. Then, we test the normality of e and b. If all 2 columns of b 
follow normal distributions, we consider the individual coefficients to be normally 
distributed. Otherwise, we consider them nonnormally distributed. Similarly, if 
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Table 7. Mean squared errors and coverage probabilities for different data and models (T = 5, C= 5, ors =0, o 0.1) 
N=50 N=200 N=500 
Semi-N N-Semi Semi-Semi|Semi-N N-Semi Semi-Semi|Semi-N N-Semi Semi-Semi 
Semi-N MSE 0.016 0.015 0.015 0.004 0.004 0.004 0.002 0.002 0.002 
data CP 0.957 0.672 0.674 0.953 0.677 0.686 0.934 0.642 0.642 














N-Semi MSE 0.009 0.007 0.007 0.003 0.001 0.001 0.001 0.001 0.001 
data CP 0.892 0.940 0.885 0.882 0.943 0.893 0.903 0.948 0.907 


Semi-Semi MSE 0.013 0.008 0.008 | 0.003 0.002 0.002 | 0.001 0.001 0.001 
data CP 0.965 0.973 0.970 | 0.968 0.975 0.973 | 0.943 0.960 0.955 
Note. MSE: mean square error; CP: coverage probability. In the table, on the rows are the different types of generated data with sample 
size = 50, 200, and 500. On the columns are the three types of distributional models used to analyze the generated data. For each type 
of the generated data, three distributional models are fitted to them. The average MSE and CP for certain model parameters are 
obtained, as displayed in the table. 
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Table 9. Mean squared errors and coverage probabilities for different data and models (T = 5, C = 20, oxs = 0, o2= 0.1) 
N=50 N=200 N=500 
Semi-N N-Semi Semi-Semi|Semi-N N-Semi Semi-Semi|Semi-N N-Semi Semi-Semi 
Semi-N MSE 0.016 0.015 0.015 0.004 0.004 0.004 0.001 0.001 0.001 
data CP 0.950 0.675 0.684 0.947 0.683 0.676 0.944 0.655 0.659 











N-Semi MSE 0.010 0.005 0.005 0.034 0.001 0.001 0.025 0.001 0.001 
data CP 0.903 0.958 0.908 0.930 0.963 0.927 0.878 0.952 0.900 


Semi-Semi MSE 0.009 0.007 0.007 | 0.003 0.002 0.002 0.001 0.001 0.001 
data CP 0.978 0.973 0.978 0.953 0.958 0.960 0.960 0.960 0.963 
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Table 11. Mean squared errors and coverage probabilities for different data and models (T = 5, C= 5, ors =0, 02 
N=50 N=200 N=500 
Semi-N N-Semi Semi-Semi|Semi-N N-Semi Semi-Semi|Semi-N N-Semi Semi-Semi 
Semi-N MSE 0.021 0.021 0.020 0.006 0.006 0.006 0.002 0.002 0.002 
data CP 0.953 0.845 0.841 0.939 0.822 0.819 0.946 0.833 0.827 
N-Semi MSE 0.018 0.008 0.008 0.003 0.002 0.002 0.002 0.001 0.001 
data CP 0.870 0.948 0.872 0.868 0.938 0.865 0.872 0.937 0.870 
Semi-Semi MSE 0.015 0.011 0.011 0.004 0.003 0.003 | 0.001 0.001 0.001 
data CP 0.970 0.968 0.968 | 0.948 0.960 0.953 | 0.953 0.960 0.958 
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all T columns of e are normally distributed, the errors are viewed as from normal 
distributions. If e and b are not normally distributed, semiparametric approach 
is recommended. Based on the combination of the distributions for e and b, the 
decision can be made according to Table 12. 


Table 12. Distribution checking based on individual growth curve analysis 








Errors Individual Coefficients Model 

normal normal N-N distributional model 
nonnormal normal Semi-N distributional model 

normal nonnormal N-Semi distributional model 
nonnormal nonnormal Semi-Semi distributional model 





5 Discussion 


Restricting to a parametric probability family can delude investigators and 
falsely make an illusion of posterior certainty (Miller & Mitra, 2004). In 
this study, we proposed a semiparametric Bayesian approach for growth curve 
analysis with nonnormal data. The normal distributions of the random effects 
and/or measurement errors of traditional growth curve model were replaced 
by random distributions with DPM priors. Thus, four types of distributional 
growth curve models were discussed, including the traditional N-N model, the 
robust Semi-N, N-Semi, and Semi-Semi models. Through a simulation study, 
we systematically evaluated the performance of the semiparametric Bayesian 
method and further assessed the effects of the misspecification of the four types 
of distributional growth curve models to compare the intrinsic characteristics 
of them. Seven potentially influential factors were considered including type of 
data (N-N data, Semi-N data, N-Semi data, Semi-Semi data), type of model 
(N-N model, Semi-N model, N-Semi model, Semi-Semi model), number of 
measurement occasions (T = 3, 5), potential number of clusters (C' = 5, 20), 
the covariance between the latent intercept and slope (o,s = 0, 0.3), variance 
of measurement errors (o2 = 0.1, 0.3), and sample size (N = 50, 100, 200). 
Among the seven factors, the number of measurement occasions, the potential 
number of clusters, the covariance between the latent intercept and slope, and the 
variance of measurement errors were not influential to the comparison among the 
performance of the four types of distributional models. The following conclusions 
can be drawn for the other three factors. 

First, the three types of semiparametric models perform as well as, or better 
than, the traditional N-N model, especially when data are nonnormal. When 
data are normally distributed, we may obtain slightly biased but more efficient 
parameter estimates by using the semiparametric models. It is possible for the 
semiparametric models to lead to worse CPs, but the MSEs are often smaller. 
When data are nonnormal, we recommend using the robust models instead of 
the traditional growth curve model as they provide much more accurate and 
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precise parameter estimates. Second, the semiparametric approach can improve 
the efficiency of the parameter estimation. For example, in Tables 4-6, the 
standard errors in the right panel are uniformly larger than those in the left 
panel, indicating the parameter estimation from the traditional growth curve 
analysis is less efficient. However, we would like to note that although the 
Semi-Semi model is the most general type of models, it is not always optimal. 
Misusing the Semi-Semi model could result in lower CPs and more type I errors. 
Moreover, fitting the Semi-Semi model to data is more time-consuming than 
fitting simpler models. Therefore, it is important to specify the correct type of 
model for practical data analyses. The “eyeball” method and the “distribution 
checking based on individual growth curve analysis” method can be used for 
model diagnostics (see Tong & Zhang, 2012). Third, the increase of the sample 
size can often improve the performance of all the four types of models. As shown 
in Tables 7-11, MSEs become smaller when sample size increases, but sample size 
does not affect the comparison among the four types of models. In general, we 
recommend using robust semiparametric models, especially when nonnormality 
is suspected. 


For the semiparametric Bayesian approach, the normal assumption is replace 
by a random distribution with a DPM prior. In our study, the random 
distribution is a mixture of multivariate normal distributions with the mixing 
proportions generated following certain rules (e.g., truncated stick-breaking 
construction). So, similar to the finite growth mixture modeling, the number 
of clusters increases along with the increase of sample size. This is reasonable, 
because the diversity increases as more subjects are enrolled in the study. 
Naturally, there need to be more clusters. However, the semiparametric Bayesian 
growth curve modeling is different from finite growth mixture modeling. For 
finite growth mixture modeling, adding one additional cluster brings in several 
more parameters to be estimated. Thus, it is not possible to have many clusters 
when we conduct finite growth mixture analyses, whereas it is not a problem for 
us to obtain a large number of clusters if we use the semiparametric Bayesian 
method. The number of parameters for the semiparametric Bayesian model keeps 
the same no matter how many clusters there are. 


We would like to note that the DP precision parameter a governs the 
expected number of clusters. Smaller values of a result in a smaller number 
of clusters. In this study, the DP precision parameter a has an informative 
prior Gamma(100, 100) to reduce the computational complexity and convergence 
issue. The as generated from the MCMC procedure are very close to 1. 
When a equals 1, about 90% prior weight on between 3 and 7 clusters (Lunn 
et al., 2013). Tong & Ke (2021) evaluated the effect of precision parameter 
prior on model estimation, model convergence, and computation time. They 
recommended using informative priors for the precision parameter, even when 
the information is inaccurate. Following their recommendation, the informative 
prior Gamma(100, 100) was chosen in this study. 
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Limitations and future directions 


In this study, we proposed to use a random mixture distribution to replace the 
normal assumption for robustness, but the distribution of mixture components 
is still specified as normal. To be more general, the distribution of mixture 
components can be nonnormal as well. For example, it is quite possible that 
the t distribution is a better substitute, and the Gamma distribution probably 
can better accommodate the skewness in the data. Thus, the influence of the 
distribution form of the mixture components needs further evaluation. 

Note that we only compared the parameter estimation for model comparison. 
How well the models fit the data is not evaluated. Deviance Information Criterion 
(DIC) is widely used to evaluate the model fit in Bayesian analysis. Despite 
the popularity of DIC, it has received much criticism since it was proposed 
(Spiegelhalter et al., 2002). Celeux et al. (2006) argued that the DIC introduced 
by Spiegelhalter et al. for model assessment and model comparison was directly 
inspired by linear and generalized linear models, but it was open to different 
possible variations in the setting of models involving random effects, as in 
our robust growth curve models. A number of ways of computing DICs are 
proposed in Celeux et al. (2006), and their advantages and disadvantages are 
discussed. However, the calculation of DIC in semiparametric Bayesian analysis 
has not been studied. Thus, a more sophisticated way to calculate DIC should 
be considered deeply in the future, since DIC is an important index to evaluate 
the model performance. 

This study focuses on robust simple linear growth curve models for 
demonstration. However, the same methods should work for nonlinear growth 
curve models as well. The performance of the more complicated semiparametric 
growth curve models (e.g. logistic and Gompertz models) can be studied in the 
future. 
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